White Wine Quality by Diogo Cosin Ayres de Oliveira

This report explores a dataset containing informations about 4898 white wines provided by Cortez et al. (2009). Each observation is described by its chemical properties and experts quality review.

First, some numbers about the dataset.

## [1] 4898   13

The dataset contains 4898 observation with 13 attributes including one for index and another for the quality grade.

Below, the summary statics about these 13 attributes.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Univariate Plots Section

Wine Quality

Each wine is rated in a 0 (very bad) - 10 (very excellent) grade by at least 3 wine experts.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

We see a normal distribution in the quality attribute histogram. Most wines are reviewed with quality around 6. The maximun quality observed is 9 while minumum is 3. No wine received a perfect 10 review and just a few has got a 9 review. Let’s try to understand throughout this EDA report what factors produce better wines, accorgind to experts opinions.

Acidity Attributes

Now let’s visualize the acidity chemical properties of our dataset wines.

It is noticed a normal distribution in the acidic wines attributes, however we can see too many residues values to the right causing the maximum values to be distant from median values. Can the less frequency bins indicate a higher quality wine once most wines present median attributes values? Maybe in multivariate analysis we will find some relationship between these attributes and the wine quality.

Sulfur and Sulphate Attributes

Following the univariate exploration, let’s plot the histogram distribution fo the sulfur and sulphate attributes.

Again, as we have seen for the acidic attributes, most wine have same characteristics regarding its sulphates attributes given that accumulate around median values. Soon, as the quality distribution is approximatelly normal, we may expect that wine with atypical acidic attributes are better reviewed than median ones and we will test in multivariate analysis.

pH and Alcohol

Let’s plot the related pH and alcohol attributes.

Trough the plots we can see that the distribution curve of pH attribute is normal while alcohol curve is left skewed. It is interisting that both plots don’t present same distribution since I’ve expected that alcohol content would be highly related to the pH. However, the different distribution shapes show the opposite.

Residual Sugar, chlorides and density

Finally, let’s explore the remaining attributes distributions: residual sugar, chlorides and density.

We can see that all attributes present a approximattely normal distribution with execption of residual sugar attribute. Residual sugar is highly left skewed. In order to see residual sugar behaviour on others regions, let’s rearrange our plot with a log scale on x-axis.

With log10 x scale it’s possible to notice a bimodal distribution with most wines having residual sugar between 1 and 2 or between 7 and 15 showing that some wines present higher sugar amount than others.

Bivariate e Multivariate Plots Section

Correlation Matrix

Before starting to analyse the bivariate plots, let’s produce the correlation matrix using pearson method in order to have initial correlation indexs between attributes and to focus our multivariate analysis on those attributes.

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

Through the matrix we can see strong correlations between some attributes. For instance, density and alcohol present linear correlation of -0.78. Density and residual sugar, linear correlation of 0.84. Quality and alcohol, linear correlation of 0.43.

Creating Sulphates and Residual Sugar buckets

In order to help in our multivariate exploration, let’s bucket the residual sugar and sulphates buckets so that we color others scatter plots attributes using these buckets as references.

df_ww$quality <- factor(df_ww$quality)
df_ww$sulphates.bucket <- cut(df_ww$sulphates, breaks=c(seq(0.2,1.1,0.2)))
df_ww$residual.sugar.bucket <- cut(df_ww$residual.sugar, breaks=c(seq(0,14,3)))
str(df_ww$sulphates.bucket)
##  Factor w/ 4 levels "(0.2,0.4]","(0.4,0.6]",..: 2 2 2 1 1 2 2 2 2 2 ...
str(df_ww$residual.sugar.bucket)
##  Factor w/ 4 levels "(0,3]","(3,6]",..: NA 1 3 3 3 3 3 NA 1 1 ...

Density Exploration

Density is expected to have strong linear relationship with residual sugar and alcohol, given that these last two attributes alter the wine water density. Let’s plot scatter plots in oder to visualize these realtionships.

As expected, density present strong linear correlation with residual sugar and alcohol. Given that these two factors acts directly in changing the water density, due to chemical concepts, it’s possible to say that density holds a causation relationship with them.

pH and Fixed Acidity Exploration

Also, through the matrix, we see a strong corrlearion index between pH and fixed acidity. Again, chemical concepts support these relationship once acidity influences the pH substance. Let’s visualize it.

The correlation is confirmed and we notice that most wines have pH between 3.0 and 3.3. Also, the relationship is not so strong as expected. Maybe others attributes besides fixed acidity are influencing the wines pH.

First, let’s see trough a scatter plot how sulphates effect the pH vs Fixed Acidity distribution.

It is hard to detect some pattern on how sulphates alter a pH wine. Trough the previous plot it is not possible to find any tendency.

Proceeding with the pH vs fixed acidity exploration, let’s now color our scatter plot with the residual sugar attribute as reference.

Through the previous plot, we see a tendency where wines with higher residual sugar amount present lower pH (more acid). However this tendency is weak and we can’t see a clear pattern of how residual sugar influences the pH. This means that in addition to the fixed acidity, residual sugar may be acting in a wine pH, despite not having strong linear correlation.

Quality Exploration

Now let’s explore how wine quality relates with some attributes. First, as detected by the correlation matrix, quality presents strong linear correlation with alcohol. Let’s visualize this relation in order to confirm it.

In fact, we see that wines with more alcohol presence tend to be better reviewed by the experts.

This plot show the interisting strong correlation between alcohol. It’s easy to notice that wine quality inceases as alcohol volume increases and density decreases. However, we see some outliers of good quality with low alcohol amount and high density. Maybe others factors like residual sugar and fixed acidty also influence in wine quality altough not presenting strong linear correlation.

So, in order to explore which others factor make a high quality wine, let’s produce other plots replacing density by them.

Those outliers are still there with high residual sugar amount and low alcohol. So residual sugar by itself doesn’t explain them.

Let’s see now volatile acidity.

This plot show a interesting slight tendency where wine with low volatile acidity amount on lower alcohol present better quality.

It is really difficult to find some pattern on total sulfur dioxide’s influence on wine quality. As detected by previous scatter multivariate plots, alcohol strong correlates with wine quality, however we don’t see clearly in the this plot how total sulfur dioxide exactly acts on wine quality.

Final Plots and Summary

Plot One

Description One

Through this plot we can see that wine quality strongly correlates with alcohol amount. Quality tend to increase as alcohol increases. Moreover, according to the correlation matrix produced previously on this report, alcohol is the attribute that presents the most substantial correlation with quality wine.

Plot Two

Description Two

The correlation matrix also identified really strong correlations between density and alcohol or residual sugar. In fact, this is exepected given that alcohol and residual sugar alter the water wine density due chemical concepts. Visualizing this relationship on the previous plots, it is possible to confirm the correlation showing that density tend to increase as residual sugar increases or tend to decrease as alcohol increases.

Plot Three

Description Three

Again we see the relevant correlation between alcohol and quality, however this plot also show a slight tendency where quality tend to increase as volatile acidity decreases at lower alcohol volumes amounts. This fact shows that, besides alcohol, others attributes contribute to the wine quality and when mixed may determine a high quality wine.


Reflection

The dataset provided by Cortez et al. (2009) contains attributes of 4,898 white wines. With this data collectiokn it was possible to explore how this chemical atributes correlate bewteen them and also how they impact in a white wine quality. Each wine was reviewed by experts with grades between 1 and 10. Initially, the exploration approached the attributes distribution with univariate histogram plots aid. Most attributes presented normal distribution with residual sugar exception. After that, through the matrix correlation and bivariate scatter plots, the correlations between the attributes were analyzed. Some chemical concepts were confirmed by really strong correlations, as density depending on residual sugar and alcohol. Also, still on bivariate analysis, it was surprising to discover that alcohol produced the highest correlation with quality wine. It was really hard to detect some pattern in correlation between wine quality and other attributes, but thanks to multivariate scatter plots, it was possible to notice that other attributes also acts on wine quality. Volatile acidity contibutes in a higher wine quality given that low volatile acidity tends to increase quality. However, others multivariates plots failed in showing strong patterns and correlation and was not possible to gather some insights and resolutions trough them. Thinking in future works, this dataset could be pontentialized with more wine attributes observations so that a quality prediction model could be made using machine learning techniques.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.